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Abstract. We present an I/O-efScient algorithm for computing sim¬ 
ilarity joins based on locality-sensitive hashing (LSH). In contrast to 
the filtering methods commonly suggested our method has provable sub¬ 
quadratic dependency on the data size. Further, in contrast to straightfor¬ 
ward implementations of known LSH-based algorithms on external mem¬ 
ory, our approach is able to take significant advantage of the available 
internal memory: Whereas the time complexity of classical algorithms 
includes a factor of , where p is a parameter of the LSH used, the I/O 
complexity of our algorithm merely includes a factor (N/M)^, where N 
is the data size and M is the size of internal memory. Our algorithm is 
randomized and outputs the correct result with high probability. It is a 
simple, recursive, cache-oblivious procedure, and we believe that it will 
be useful also in other computational settings such as parallel computa¬ 
tion. 


Keywords: Similarity join; locality sensitive hashing; cache aware; cache obliv¬ 
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1 Introduction 

The ability to handle noisy or imprecise data is becoming increasingly important 
in computing. In database settings this kind of capability is often achieved using 
similarity join primitives that replace equality predicates with a condition on 
similarity. To make this more precise consider a space U and a distance func¬ 
tion d : U X U — ^ R. The similarity join of sets R, S C U is the following: 
Given a radius r, compute the set R ixi<j. S = {(x,y) € R x S \ d{x,y) < r}. 
This problem occurs in numerous applications, such as web deduplication [3,13, 
19], document clustering [4], data cleaning [2,6]. As such applications arise in 
large-scale datasets, the problem of scaling up similarity join for different metric 
distances is getting more important and more challenging. 

Many known similarity join techniques (e.g., prefix filtering [2,6], positional 
filtering [19], inverted index-based filtering [3]) are based on filtering techniques 
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that often, but not always, succeed in reducing computational costs. If we let 
N = |i?| + [S'! these techniques generally require Q{N‘^) comparisons for worst- 
case data. Another approach is locality-sensitive hashing (LSH) where candidate 
output pairs are generated using collisions of carefully chosen hash functions. 
The LSH is defined as follows. 


Definition 1. Fix a distance function d : U x U —>■ R. For positive reals 
r,c,pi,p 2 , and Pi >p 2 ,c> 1, a family of functions H is (r, cr,pi,p 2 )-sensitive 
if for uniformly chosen h € H and all x,y € U; 

— If d(x,y) < r then Pr[h(x) = h{y)] > pi; 

— If d{x,y) > cr then Pr[h(x) = h{y)] <P 2 - 

We say that H is monotonic if Pr[h{x) = h{y)] is a non-increasing function of 
the distance function d{x, y). We also say that H uses space s if a function h GH 
can he stored and evaluated using space s. 

LSH is able to break the N'^ barrier in cases where for some constant c > 1 
the number of pairs in R txi<cr S is not too large. In other words, there should 
not be too many pairs that have distance within a factor c of the threshold, the 
reason being that such pairs are likely to become candidates, yet considering 
them does not contribute to the output. For notational simplicity, we will talk 
about far pairs at distance greater than cr (those that should not be reported), 
near pairs at distance at most r (those that should be reported), and c-near 
pairs at distance between r and cr (those that should not be reported but the 
LSH provides no collision guarantees). 

Our contribution. In this paper we study I/O-efficient similarity join meth¬ 
ods based on LSH. That is, we are interested in minimizing the number of I/O 
operations where a block of B points from U is transferred between an external 
memory and an internal memory with capacity for M points from U. Our main 
result is the first cache-oblivious algorithm for similarity join that has provably 
sub-quadratic dependency on the data size N and at the same time inverse 
polynomial dependency on M. In essence, where previous methods have an over¬ 
head factor of either N/M or {N/B)p we obtain an overhead of {N/M)p, where 
0 < p < 1 is a parameter of the LSH employed, strictly improving both. We 
show: 


Theorem 1. Consider R,S C U, let iV = |i?| -|- lAI, assume ISloglV -|- 3B < 
M < N and that there exists a monotonic (r,cr,pi,p 2 )-sensitive family of func¬ 
tions with respect to distance measure d, using space B and with p 2 < pi < 1/2. 
Let p = logpi/logp 2 . Then there exists a cache-oblivious randomized algorithm 
computing R ixi<r S (w.r.t. d) with probability 1 — 0 (l/N) using 
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\R cx S\ 

<cr 

MB 


I/Os} 


^ The O (-(-notation hides polylog(A) factors. 
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We conjecture that the bound in Theorem 1 is close to the best possible for the 
class of “signature based” algorithms that work by generating a set of LSH values 
(from a black-box and monotonic family) and checking all pairs that collide. Our 
conjecture is based on an informal argument, given in full in Section 4. We 
describe a worst-case input, where it seems significant advances are required to 
beat Theorem 1 asymptotically. Further, we observe that for M = TV our bound 
coincides with the optimal bound of reading the input, and when M = 1 our 
bound coincides with the bounds of the best known internal memory algorithms. 

It is worth noting that whereas most methods in the literature focus on a 
single (or a few) distance measure, our method works for an arbitrary space 
and distance measure that allows LSH, e.g., Hamming, Manhattan (^i). Eu¬ 
clidean (^^ 2 ), Jaccard, and angular metric distances. Since our approach makes 
use of LSH as a black box, the problem of reporting the complete join result 
with certainty would require major advances in LSH methods (see [16,17] for 
recent progress in this direction). 

A primary technical hurdle in the paper is that we cannot use any kind of 
strong concentration bounds on the number of points having a particular value, 
since hash values of an LSH family may be correlated by definition. Another hur¬ 
dle is duplicate elimination in the output stemming from pairs having multiple 
LSH collisions. However, in the context of l/O-efHcient algorithms it is natural 
to not require the listing of all near pairs, but rather we simply require that 
the algorithm enumerates all such near pairs. More precisely, the algorithm calls 
for each near pair {x,y) a function emit(a;,?/). This is a natural assumption in 
external memory since it reduces the I/O complexity. In addition, it is desired 
in many applications where join results are intermediate results pipelined to a 
subsequent computation, and are not required to be stored on external memory. 
Our upper bound can be easily adapted to list all instances by increasing the 
I/O complexity of an unavoidable additive term of 0 (|i? [xi<r S\/B) I/Os. 

Organization. The organization of the paper is as follows. In Section 2, 
we briefly review related work. Section 3 describes our algorithms including a 
warm-up cache-aware approach and the main results, a cache-oblivious solution, 
its analysis, and a randomized approach to remove duplicates. Section 4 provides 
some discussions on our algorithms with some real datasets. Section 5 concludes 
the paper. 

2 Related Work 

In this section, we briefly review LSH, the computational I/O model, and some 
state-of-the-art similarity join techniques. 

Locality-sensitive hashing (LSH). LSH was originally introduced by In- 
dyk and Motwani [14] for similarity search problems in high dimensional data. 
This technique obtains a sublinear (i.e., O {N^)) time complexity by increasing 
the gap of collision probability between near points and far points using the LSH 
family as defined in Definition 1. The gap of collision probability is polynomial, 
with an exponent of p = logpi/logp 2 dependent on c. 
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It is worth noting that the standard LSHs for metric distances, including 
Hamming [14], £i [7], £2 [1)7], Jaccard [4] and angular distances [5] are mono¬ 
tonic. These common LSHs are space-efficient, and use space comparable to that 
required to store a point, except the LSH of [1] which requires space We 

do not explicitly require the hash values themselves to be particularly small. 
However, using universal hashing we can always map to small bit strings while 
introducing no new collisions with high probability. Thus we assume that B hash 
values fit in one memory block. 

Computational I/O model. We study algorithms for similarity join in the 
external memory model, which has been widely adopted in the literature (see, e.g., 
the survey by Vitter [18]). The external memory model consists of an internal 
memory of M words and an external memory of unbounded size. The processor 
can only access data stored in the internal memory and move data between the 
two memories in blocks of size B. For simplicity we will here measure block and 
internal memory size in units of points from U, such that they can contain B 
points and M points, respectively. 

The I/O complexity of any algorithm is defined as the number of input /output 
blocks moved between the two memories by the algorithm. The cache-aware ap¬ 
proach makes explicit use of the parameters M and B to achieve its I/O com¬ 
plexity, whereas the cache-oblivious one [9] does not explicitly use any model 
parameters. The latter approach is desirable as it implies optimality on all levels 
of the memory hierarchy and does not require parameter tuning when executed 
on different physical machines. Note that the cache-oblivious model assumes 
that the internal memory is ideal in the sense that it has an optimal cache- 
replacement policy. Such cache-replacement policy can evict the block that is 
used furthest in the future, and can place a block anywhere in the cache (full 
associativity). 

Similarity join techniques. We review some state-of-the-art of similarity 
join techniques most closely related to our work. 

— Index-based similarity join. A popular approach is to make use of index¬ 
ing techniques to build a data structure for one relation, and then perform 
queries using the points of the other relation. The indexes typically perform 
some kind of filtering to reduce the number of points that a given query point 
is compared to (see, e.g., [3,6,10]). Indexing can be space consuming, in par¬ 
ticular for LSH, but in the context of similarity join this is not a big concern 
since we have many queries, and thus can afford to construct each hash table 
“on the fly”. On the other hand, it is clear that index-based similarity join 
techniques will not be able to take significant advantage of internal memory 
when N ^ M. Indeed, the query complexity stated in [10] is O {{N/Bfi) 
I/Os. Thus the I/O complexity of using indexing for similarity join will be 
high. 

— Sorting-based. The indexing technique of [10] can be adapted to compute 
similarity joins more efficiently by using the fact that many points are being 
looked up in the hash tables. This means that all lookups can be done in 
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a batched fashion using sorting. This results in a dependency on N that is 
O I/Os, where p G (0; 1) is a parameter of the LSH family. 

— Generic joins. When N is close to M the 1/ 0-complexity can be improved 
by using general join operators optimized for this case. It is easy to see 
that when N/M is an integer, a nested loop join requires N‘^/{MB) I/Os. 
Our cache-oblivious algorithm will make use of the following result on cache- 
oblivious nested loop joins: 


Theorem 2. (He and Luo [12]) Given a similarity join condition, the join 
of relations R and S can be computed by a cache-oblivious algorithm in 


O 




\R\\s\ \ 

MB ) 


I/Os. 


This number of I/Os suffices to generate the result in memory, but may not 
suffice to write it to disk. 

We note that a similarity join can be part of a multi-way join involving 
more than two relations. For the class of acyclic joins, where the variables 
compared in join conditions can be organized in a tree structure, one can 
initially apply a full reducer [20] that removes tuples that will not be part of 
the output. This efficiently reduces any acyclic join to a sequence of binary 
joins. Handling cyclic joins is much harder (see e.g. [15]) and outside the 
scope of this paper. 


3 Our Algorithms 

In this section we describe our I/O efficient algorithms. We start in Section 3.1 
with a warm-up cache-aware algorithm. It uses an LSH family where the value 
of the collision probability is set to be a function of the internal memory size. 
Section 3.2 presents our main result, a recursive and cache-oblivious algorithm, 
which uses the LSH with a black-box approach and does not make any assump¬ 
tion on the value of collision probability. Section 3.3 describes the analysis and 
Section 3.4 shows how to reduce the expected number of times of emitting near 
pairs. 


3.1 Cache-aware algorithm: ASimJoin 

We will now describe a simple cache-aware algorithm called ASimJoin, which 
achieves the worst case I/O bounds as stated in Theorem 1. ASimJoin relies on 
an (r, cr,p/,p^-sensitive family H' of hash functions with the following proper¬ 
ties: P 2 < M/N and p[ > {M/N)p, for a suitable value 0 < p < 1. Given an 
arbitrary monotonic (r, cr,pi,p 2 )-sensitive family H, the family TL' can be built 
by concatenating (Xogp/M/Nf] hash functions from H. For simplicity, we as¬ 
sume that \ogp/M/N) is an integer and thus the probabilities p] and P 2 can be 
exactly obtained. Nevertheless, the algorithm and its analysis can be extended 
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Algorithm ASimJoin(i?, S'): R,S are the input sets. 
1 Repeat 3 log {N) times 


2 

1 Associate to each point in R and S a counter initially set to 0; 

3 

Repeat L = 

: 2/p'i times 

4 


Choose £ R.' uniformly at random; 

5 


Use h'i to partition (in-place) R and S in buckets Ry, Sv of points with 




the hash value w; 

6 


For each hash value v generated in the previous step 

7 



1 / 

* For simplicity we assume that |R„| < |S„| */ 

8 



Split Ry and Sy into chunks and Si,„ of size at most M/2; 

9 



For every chunk Ri^y of Ry 

10 




Load in memory Ri.y', 

11 




For every chunk Si^y of Sy do 

12 





Load in memory Si^y\ 

13 





Compute Ri^y x Si^y and emit all near pairs. For each far 






pair, increment the associated counters by 1; 

14 





Remove from Si,y and Ri^y all points with the associated 






counter larger than 8LM, and write back to external 






memory; 

15 




1 Write Ri^y back to external memory; 


to the general case by increasing the I/O complexity by a factor at most in 
the worst case; in practical scenarios, this factor is a small constant [4, 7,10]. 

ASimJoin assumes that each point in R and S is associated with a counter 
initially set to 0. This counter can be thought as an additional dimension of 
the point which hash functions and comparisons do not take into account. The 
algorithm repeats L = 2/p'i times the following procedure. A hash function 
is randomly drawn from the (r,cr,p']^,P 2 )"Sensitive family, and it is used for 
partitioning the sets R and S into buckets of points with the same hash value. 
We let Rv and S'„ denote the buckets respectively containing points of R and 
S with the same hash value v. Then, the algorithm iterates through every hash 
value and, for each hash value v, it uses a double nested loop for generating all 
pairs of points in R^ x Sy. The double nested loop loads consecutive chunks of 
Ry and Sy of size at most M/2: the outer loop runs on the smaller set (say Ry), 
while the inner one runs on the larger one (say Sy). For each pair {x,y), the 
algorithm emits the pair if d{x, y) < r, increases by 1 counters associated with 
X and y if d{x, y) > cr, or ignores the pair if r < d{x, y) < cr. Every time the 
counter of a point exceeds 8LM, the point is considered to be far away from 
all points and will be removed from the bucket. Chunks will be moved back 
in memory when they are no more needed. The entire ASimJoin algorithm is 
repeated 3 log N times to find all near pairs with high probability. The following 
theorem shows the I/O bounds of the cache-aware approach. 

Theorem 3. Consider i?. S' C U and let A = |i?| -|- jSj be sufficiently large. 
Assume there exists a monotonic {r,cr,p[,P 2 )-sensitive family of functions with 
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respect to distance measure d with p'^ = {M/NY and P 2 = M/N, for a suitable 
value 0 < p < 1. With probability 1 — 1/N, the ASimJoin algorithm enumerates 
all near pairs using 
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Proof. We observe that the I/O cost of Steps 4-5, that is of partitioning sets 
R and S according to a hash function h/ is L ■ sort(A) = O {{^)^ • for 
L < {-^Y repetitions^. 

We now consider the I/O cost of an iteration of the loop in Step 6 for a given 
hash value v. When the size of one bucket (say R/) is smaller than M/2, we are 
able to load the whole Ry into the internal memory and then load consecutive 
blocks of Sy to execute join operations. Hence, the I/O cost of this step is at 
most -I- \Sy\)/B. The total I/O cost of the 3Llog(iV) iterations of Step 6 
among all possible hash values where at least one bucket has size smaller than 
M/2 is at most L ■ ^ = 2(^)^ • ^ I/Os. 

The I/O cost of Step 6 when both buckets Ry and Sy are larger than M/2 
is 2|ii„| |5't,|/(HM). This means that the amortized cost of each pair in Ry x Sy 
is 2/{BM). Therefore the amortized I/O cost of all iterations of Step 6, when 
there are no bucket size less than M/2, can be upper bounded by multiplying 
the total number of generated pairs by 2/{BM). Based on this observation, we 
classify and enumerate generated pairs into three groups: near pairs, c-near pairs 
and far pairs. We denote by Cn, Can and Cf the respective size of each group, 
and upper bound these quantities to derive the proof. 

1. Number of near pairs. By definition, LSH gives a lower bound on the prob¬ 
ability of collision of near pairs. It may happen that the collision probabil¬ 
ity of near pairs is 1. Thus, two near points might collide in all L repeti¬ 
tions of Step 3 and in all 31og(A^) repetitions of Step I. This means that 
Cn < 3Llog(fV)|i? ixi<,r <S'|. Note that this bound is a deterministic worst 
case bound. 

2. Number of c-near pairs. Any c-near pair from R t<i<cr S appears in a bucket 
with probability at most p/ due to monotonicity of our LSH family. Since we 
have L = 2lp{ repetitions, each c-near pair collides at most 2 in expectation. 
In other words, the expected number of c-near pair collisions among L rep¬ 
etitions is at most 2\R [xi<cr >S'|. By using the Chernoff bound [8, Exercise 
1.1] with 31og(A^) independent L repetitions (in Step 1), we have 

Pr[Ce„ > 61 og(Af)|AM<c. ^|] < l/^^^ 

Pr [Ccn < 6log {N)\R ^|] > 1 - l/N^. 

^ We let sort(A) = O (j^N/B) logg{N/B)J be shorthand for the I/O complexity 
[18] of sorting N points. 
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3. Number of far pairs. If a; G i? U S' is far away from all points, the expected 
number of collisions of a; in S hash table (including duplicates) is at most 
8LM, since then the point is removed by Step 14. Hence the total number 
of examined far pairs is Cf < 24:NLM\og (N). 

Therefore, by summing the number of near pairs Cn, c-near pairs Ccm and far 
pairs Cf, and multiplying these quantities by the amortized I/O complexity 
2/{BM), we upper bound the I/O cost of all iterations of Step 6, when there 
are no buckets of size less than M/2, is 

, \R^<rS\\ , \R S\\ 

\\m) \B^ bm bm )' 

with probability at least 1 — 1/iV^. By summing all the previous bounds, we get 
the claimed I/O bound with high probability. 

We now analyze the probability to enumerate all near pairs. Consider one 
iteration of Step 1. A near pair is not emitted if at least one of the following 
events happen: 

1. The two points do not collide in the same bucket in each of the L iterations 

of Step 3. This happens with probability (1 — = (1 — < 1/e^. 

2. One of the two points is removed by Step 14 because it collides with more 
than 8LM far points. By the Markov’s inequality and since there are at most 
N far points, the probability that x collides with at least 8LM points in the 
L iterations is at most 1/8. Then, this event happens with probability at 
most 1/4. 

Therefore, a near pair does not collide in one iteration of Step 1 with probability 
at most 1/e^ + 1/4 < 1/2 and never collides in the 31ogfV iterations with prob¬ 
ability at most (1/2)^*'’®^ = 1/N^. Then, by an union bound, it follows that 
all near pairs (there are at most of them) collide with probability at least 
1 — \/N and the theorem follows. □ 

As already mentioned in the introduction, a near pair {x, y) can be emitted 
many times during the algorithm since points x and y can be hashed on the 
same value in p{x,y)L rounds of Step 3, where p{x,y) > p'l denotes the actual 
collision probability. A simple approach for avoiding duplicates is the following: 
for each near pair found during the z-th iteration of Step 3, the pair is emitted 
only if the two points did not collide by all hash functions used in the previous 
z — 1 rounds. The check starts from the hash function used in the previous round 
and backtracks until a collision is found or there are no more hash functions. 
This approach increases the worst case complexity by a factor L. Section 3.4 
shows a more efficient randomized algorithm that reduces the number of replica 
per near pair to a constant. This technique also applies to the cache-oblivious 
algorithm described in the next section. 



Algorithm OSimJoin(i?, S', f/'): R,S are the input sets, and ip is the re¬ 
cursion depth. 


1 

2 

3 

4 

5 

6 

7 

8 
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If |i?| > IS], then swap (the references to) the sets such that |i?| < |S|; 

If Ip = \[' OT |i?| < 1, then compute R ixi<r S using the algorithm of Theorem 2 
and return; 

Pick a random sample S' of 18Z\ points from S (or all points if |S| < ISA); 
Compute R' containing all points of R that have distance smaller than cr to at 
least half points in S'; 

Compute R' ixi<j. S using the algorithm of Theorem 2; 

Repeat L — l/pi times 

Choose h G R uniformly at random; 

Use h to partition (in-place) R\R' and S in buckets Rv, Sv of points with 
hash value v; 

For each v where Rv and Sv are nonempty, recursively call 
OSimJoin (Rv, Sv,ip + 1); 


3.2 Cache-oblivious algorithm: OSimJoin 

The above cache-aware algorithm uses an (r, cr, , p^-sensitive family of func¬ 
tions TL', with p'l ~ (M/NY P 2 ~ M/N, for partitioning the initial sets 
into smaller buckets, which are then efficiently processed in the internal memory 
using the nested loop algorithm. If we know the internal memory size M, this 
LSH family can be constructed by concatenating [logp^ (M/A)] hash functions 
from any given primitive (r, cr,pi,p 2 )-sensitive family "H. Without knowing M in 
the cache-oblivious setting, such family cannot be built. Therefore, we propose 
OSimJoin, a cache-oblivious algorithm that efficiently computes the similarity 
join without knowing the internal memory size M and the block length B. 

OSimJoin uses as a black-box a given monotonic (r, cr,pi,p 2 )-sensitive fam¬ 
ily The value of pi and p2 can be considered constant in a practical scenario. 
As common in cache-oblivious settings, we use a recursive approach for splitting 
the problem into smaller and smaller subproblems that at some point will ht 
the internal memory, although this point is not known in the algorithm. We hrst 
give a high level description of the cache-oblivious algorithm and an intuitive 
explanation. We then provide a more detailed description and analysis. 

OSimJoin receives in input the two sets R and S of similarity join, and a 
parameter ip denoting the depth in the recursion tree (initially, ip = ^) that is 
used for recognizing the base case. Let \R\ < [S'], N = |i?| -|- [S'!, and denote 
with Z\ = log A and 'F = [log^/p^ A] two global values that are kept invariant in 
the recursive levels and computed using the initial input size A. For simplicity 
we assume that 1/pi and l/p2 are integers, and further assume without loss of 
generality that the initial size A is a power of two. Note that, if 1 /pi is not an 

® The monotonicity requirement can be relaxed to the following: Pr [h(x) = h(y)] > 
Pr [h(x') = h(y')] for every two pairs (x,y) and (x',y') where d(x,y) < r and 
d(x',y') > r. A monotonic LSH family clearly satisfies this assumption. 
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integer, the last iteration in Step 6 can be performed with a random variable 
L G {[l/pij, [l/pil} such that E [L] = 1/pi. 

OSimJoin works as follows. If the problem is currently at the recursive level 
^ = \\ogifp^ iV] or |i?| < 1, the recursion ends and the problem is solved using 
the cache-oblivious nested loop described in Theorem 2. Otherwise, the following 
operations are executed. By exploiting sampling, the algorithm identifies a subset 
R' of R containing (almost) all points that are near or c-near to a constant 
fraction of points in S (Steps 3-4). Then we compute R' ixi<,r S using the cache- 
oblivious nested-loop of Theorem 2 and remove points in R' from R (Step 5). 
Subsequently, the algorithm repeats L = 1/pi times the following operations: 
a hash function is extracted from the (r, cr,pi,p 2 )-sensitive family and used 
for partitioning R and S into buckets, denoted with and with any hash 
value V (Steps 7-8); then, the join [xi<r Sy is computed recursively by 
OSiMJoiN(Step 9). 

The explanation of our approach is the following. By recursively partitioning 
input points with hash functions from R, the algorithm decreases the probability 
of collision between two far points. In particular, the collision probability of 
a far pair is P 2 ^-t the i-th recursive level. On the other hand, by repeating 
the partitioning 1/pi times in each level, the algorithm guarantees that a near 
pair is enumerated with constant probability since the probability that a near 
pair collide is p\ at the i-th recursive level. It deserves to be noticed that the 
collision probability of far and near pairs at the recursive level \ogi/p^{N/M) 
is 0 (M/N) and 0 {{M/NY), respectively, which are asymptotically equivalent 
to the values in the cache-aware algorithm. In other words, the partitioning of 
points at this level is equivalent to the one in the cache-aware algorithm with 
collision probability for a far pair p '2 = M/N. Finally, we observe that, when a 
point in R becomes close to many points in S, it is more efficient to detect and 
remove it, instead of propagating it down to the base cases. This is due to the 
fact that the collision probability of very near pairs is always large (close to I) 
and the algorithm is not able to split them into subproblems that fit in memory. 

3.3 I/O Complexity and Correctness of OSimJoin 

Analysis of I/O Complexity. We will bound the expected number of I/Os 
of the algorithm rather than the worst case. This can be converted to an high 
probability bound by running log N parallel instances of our algorithm (without 
loss of generality we assume that the optimal cache replacement splits the cache 
into M / log N parts that are assigned to each instance). The total execution 
stops when the hrst parallel instance terminates, which with probability at least 
1 — 1/A is within a logarithmic factor of the expected I/O bound (logarithmic 
factors are absorbed in the O-notation). 

For notational simplicity, in this section we let R and S denote the initial 
input sets and let R and S denote the subsets given in input to a particular 
recursive subproblem (note that, due to Step 1, R can denote a subset of R but 
also of S; similarly for S). We also let S' denote the sampling of S in Step 3, and 
with R' the subset of R computed in Step 4. Lemma 1 says that two properties 
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of the choice of random sample in Step 3 are almost certain, and the proof 
relies on Chernoff bounds on the choice of S'. In the remainder of the paper, we 
assume that Lemma 1 holds and refer to this event as A holding with probability 
1-C>(1/7V). 


Lemma 1. With probability at least 1 — 0{1/N) over the random choices in 
Step 3, the following bounds hold for every subproblem OSiMJoiN(i?, S', 7/>); 


\R' txi S\> 

<cr 


1 ^ 

6 


( 1 ) 


|(i?\i?') M S| > 

>cr 


6 


( 2 ) 


Proof. Let x G R he a point which is c-near to at most one sixth of the points 
in S, i.e., \x ixi<crS| < |S|/6. The point x enters R' if there are at least 9A c- 
near points in S' and this happens, for a Chernoff bound [8, Theorem 1.1], with 
probability at most Each point of i?US appears in at most 2 L’' < 

2LA < 2N‘^ subproblems and there are at most N points in i? U S. Therefore, 
with probability 1 - 2N^N-'^ = 1 - 2fV-\ we have that in every subproblem 
OSiMJoiN(i?, S, xp) no point with at most |S|/6 c-near points in S is in R'. Hence 
each point in R has at least |S|/6 c-near points in S, and the bound in Equation 1 
follows. 

We can similarly show that, with probability 1 — 2N^N~'^ = 1 — 2N~^, we 
have that in every subproblem OSiMJoiN(i?, S, xp) all points with at least 5|;S|/6 
c-near points in S are in R'. Then, each point in R\R' has |5'|/6 far points in S 
and Equation 2 follows. □ 


To analyze the number of I/Os for subproblems of size more than M we 
bound the cost in terms of different types of collisions of pairs in i? x S' that end 
up in the same subproblem of the recursion. We say that (x, y) is in a particular 
subproblem OSimJoin(.R, S, f/) if {x,y) G (S x S) U (S x R). Observe that a 
pair (x, y) is in a subproblem if and only if x and y have colliding hash values 
on every step of the call path from the initial invocation of OSimJoin. 


Definition 2. Given Q C R x S let Ci (Q) be the number of times a pair in Q 
is in a call to OSimJoin at the i-th level of recursion. We also let Ci^k {Q), with 
0 < k < logM, denote the number of times a pair in Q is in a call to OSimJoin 
at the i-th level of recursion where the smallest input set has size in [2^,2^“''^) if 
0 < k < logM, and in [M,-|-oo) if k = logM. The count is over all pairs and 
with multiplicity, so if (x, y) is in several subproblems at the i-th level, all these 
are counted. 


Next we bound the I/O complexity of OSimJoin in terms of Ci {R ixi<cr S) 
and Ci^k {R txi>cr S), for any Q <i <3/. We will later upper bound the expected 
size of these quantities in Lemma 3 and then get the claim of Theorem 1. 
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Lemma 2. Let £ = (A^/M)] and M > ISlogiV + 3B. Given that A 

holds, the I/O complexity of OSiMJoiN(i?, S', 0) is 

( e CARDS'] ^-iiogMCi.fe (i? M S) 

- NL ^ V J I \ J 

B ^ MB ^2-^2-^ B2^ 

i—i k—0 


Proof. To ease the analysis we assume that no more than 1 /3 of internal memory 
is used to store blocks containing elements of R and S, respectively. Since the 
cache-oblivious model assumes an optimal cache replacement policy this cannot 
decrease the I/O complexity. Also, internal memory space used for other things 
than data (input and output buffers, the recursion stack of size at most L') is 
less than M/3 by our assumption that M > 18 log n -|- 3B. As a consequence, 
we have that the number of I/Os for solving a subproblem OSimJoin (i?, 5, •) 
where |i?| < M/3 and |^| < M/3 is O -f |^|)/i?^, including all recursive 

calls. This is because there is space M/3 dedicated to both input sets and only 
I/Os for reading the input are required. By charging the cost of such subproblems 
to the writing of the inputs in the parent problem, we can focus on subproblems 
where the largest set (i.e., S) has size more than M/3. We notice that the cost of 
Steps 3-4 is dominated by other costs by our assumption that the set S' fits in 
internal memory, which implies that it suffices to scan data once to implement 
these steps. This cost is clearly negligible with respect to the remaining steps 
and thus we ignore them. 

We first provide an upper bound on the I/O complexity required by all 
subproblems at a recursive level above £. Let OSimJoin {R, S,i) be a recursive 
call at the f-th recursive level, for 0 < f < I'. The I/O cost of the nested loop join 


in Step 5 in OSimJoin {R,S,i) is O ^|;S|/i?-l- \R'\\S\/{MB)^ by Theorem 2. 


We can ignore the O 


(l^l/s) 


term since it is asymptotically negligible with 


respect to the cost of each iteration of Step 6, which is upper bounded later. By 
Equation I, we have that R' ixi<c.r S contains more than |i?'||S'|/6 pairs, and thus 


the cost of Step 5 in OSiMJoiN(i?, 5, i) is O t<i<cr S\/{MB)j. This means 
that we can bound the total I/O cost of all executions of Step 5 at level i of the 
recursion with O {Ci (i? ixi<cj. S) /{MB)) since each near pair {x,y) appears in 
Ci{{x,y)) subproblems at level i. 

The second major part of the I/O complexity is the cost of preparing recursive 
calls in OSimJoin(.R, S, i) (i.e.. Steps 7-8). In fact, in each iteration of Step 6, the 

I/O cost is O ((|i?| + |;S|)/i?'), which includes the cost of hashing and of sorting 


to form buckets. Since each point of R and S is replicated in L subproblems in 
Step 6, we have that each point of the initial sets R and S is replicated 
times at level i. Since the average cost per entry is O (l/B), we have that the 
total cost for preparing recursive calls at level i is O {NL'^^/B'). By summing 
the above terms, we have that the total I/O complexity of all subproblems in 
the f-th recursive level is upper bounded by: 
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( 3 ) 


MB ^ B 

V / 

We now focus our analysis to bound the I/O complexity required by all 
subproblems at a recursive level below Let again OSimJoin(^, i) be a 
recursive call at the i-th recursive level, ior I < i < 'I' . We observe that (part 
of) the cost of a subproblem at level i> ^ can be upper bounded by a suitable 
function of collisions among far points in OSimJoin {R,S,i). More specifically, 
consider an iteration of Step 6 in a subproblem at level i. Then, the cost for 
preparing the recursive calls and for performing Step 5 in each subproblem (at 
level i + 1) generated during the iteration, can be upper bounded as 

+ \{R\R') S\ 

B ^ BM 




since each near pair in {R\R') t><i<cr S is found in Step 5 in at most one 
subproblem at level i + 1 generated during the iteration. Since we have that 
\{R\R') ixi<cr S\ < |.R\.R'||5|, we easily get that the above bound can be rewrit¬ 
ten as O ^|i?\i?'||^|/(i?min{M, |i?\i?'|})y We observe that this bound holds 
even when i = 'F — 1: in this case the cost includes all I/Os required for solving 
the subproblems at level 'R called in the iteration and which are solved using the 
nested loop in Theorem 2 (see Step 2). By Lemma 1, we have that the above 
quantity can be upper bounded with the number of far collisions between R and 
S, getting d (^{\R\R' !Xi>cr <S|)/(i?min{M, |.R\.R'|})y 

Recall that Ci^k (Q) denotes the number of times a pair in Q is in a call to 
OSimJoin at the i-th level of recursion where the smallest input set has size in 
[2^, 2*+^) if 0 < fc < logM, and in [M, -|-oo) ii k = logM. Then, the total cost 
for preparing the recursive calls in Steps 7-8 in all subproblems at level i and 
for performing Step 5 in all subproblems at level (i -I- I) is:^ 



(7? 1X]> cr S)L 
52 ^ 


( 4 ) 


The L factor in the above bound follows since far collisions at level i are used 
for amortizing the cost of Step 5 for each one of the L iterations of Step 6. 

To get the total I/O complexity of the algorithm we sum the I/O complexity 
required by each recursive level. We bound the cost of each level as follows: for 
a level i < i we use the bound in Equation 3; for a level i > £ we use the bound 

^ We note that the true input size of a subproblem is \R\ and not |7?\i?'|. However, 
the expected value of Ci^k {R ixi>cr S) is computed assuming the worst case where 
there are no close pairs an thus R' = 0. 
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in Equation 4; for level i = £, we use the bound given in Equation 4 to which we 
add the first term in Equation 3 since the cost of Step 5 at level ^ is not included 
in Equation 4 (note that the addition of Equations 3 and 4 gives a weak upper 
bound for level t). The lemma follows. □ 

We will now analyze the expected sizes of the terms in Lemma 2. Clearly 
each pair from i? x S' is in the top level call, so the number of collisions is 
|i?||S| < N'^. But in lower levels we show that the expected number of times 
that a pair collides either decreases or increases geometrically, depending on 
whether the collision probability is smaller or larger than pi (or equivalently, 
depending on whether the distance is greater or smaller than the radius r). The 
lemma follows by expressing the number of collisions of the pairs at the t-th 
recursive level as a Galton-Watson branching process [11]. 

Lemma 3. Given that A holds, for each Q <i <W we have 


1. E 

2. E 

3. E 

4. E 


C,: I ixi S' 

>cr 


C^ [ R 1X3 S 

>r,<cr 


< |i? [XI 51 (P 2 /P 1 )*; 


< |i? IX] s\ ; 

>r,<cr 


Ci { R S 

<r 


a.fc i? M 5 

' >cr 


< |i? cx] 5| 

<r 

< (P 2 /pi)% for any 0 < k < logM. 


Proof. Let x G R and y G S. We are interested in upper bounding the number 
of collisions of the pair at the t-th recursive level. We envision the problem as 
branching process (more specifically a Galton Watson process, see e.g. [11]) where 
the expected number of children (i.e., recursive calls that preserve a particular 
collision) is Pr [h{x) = h{y)] /p\ for random /i G "H. It is a standard fact from this 
theory that the expected population size at generation i (i.e., number of times 
(x, y) is in a problem at recursive level i) is (Pr [h{x) = h{y)] /pif [11, Theorem 
5.1]. If d{x, y) > cr, we have that Pr [h(x) = h{y)] < p 2 and each far pair appears 
at most (P 2 /P 1 )* times in expectation at level i, from which follows Equation 1. 
Moreover, since the probability of collisions is monotonic in the distance, we 
have that Pr [/i(x) = h{y)] < I if d{x,y) < r, and Pr[/i(a;) = h{y)] < pi if r < 
dix, y) < cr, from which follow Equations 2 and 3. 

In order to get the last bound we observe that each entry of R and S is 
replicated L* = pj"* times at level i. Thus, we have that iV2^+^L* is the total 
maximum number of far collisions in subproblems at level i where the small¬ 
est input set has size in [2^,2^+^). Each one of these collisions survives up to 
level i with probability p\, and thus the expected number of these collisions is 
N2^+^{pi/p2)\ □ 

We are now ready to prove the I/O complexity of OSimJoin as claimed 
in Theorem 1. By the linearity of expectation and Lemma 2, we get that the 
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expected I/O complexity of OSimJoin is 


O 


NL^ 

B 


E 


E- 

2=0 


C, LR IX] 5 

<cr 


MB 


^ — 1 log M E 

EE- 

i=il k—0 


Ci^k 1 i? 1X1 R 

>cr 


L 


B2^ 


where ^ = \\ogi/p^{N/M)]. Note that CijogM {R ixi>cr S) < Ci {R txi>cr S) we 
have \R co^^r S\ < and Ci {R t<i<cr S) = Ci {R ixi<r S) + Ci {R txi>r,<cr S). 
By plugging in the bounds on the expected number of collisions given in Lemma 3, 
we get the claimed result. 


Analysis of Correctness. The following lemma shows that OSimJoin outputs 
with probability 1 — 0 (1/A) all near pairs, as claimed in Theorem 1. 

Lemma 4. Let R,SCU and |i?| + [S'] = N. Executing O ^log^^^ indepen¬ 
dent repetitions of OSimJoin (R,S,0) outputs R ixi<r S with probability at least 
1-0 (1/A). 

Proof. We now argue that a pair (x, y) with d(x, y) < r is output with probability 
C(l/i/log A). Let Xi = Ci{{x,y)) be the number of subproblems at the level 
i containing {x,y). By applying Galton-Watson branching process, we get that 
E [Xi] = (Pr [h{x) = h{y)] /pif. If Pr [h{x) = h{y)] /p\ > 1 then in fact there 
is positive constant probability that (x^y) survives indefinitely, i.e., does not go 
extinct [11]. Since at every branch of the recursion we eventually compare points 
that collide under all hash functions on the path from the root call, this implies 
that (x, y) is reported with a positive constant probability. 

In the critical case where Pr [h{x) = h{y)] /p\ = 1 we need to consider the 
variance of A^, which by [11, Theorem 5.1] is equal to zcr^, where is the 
variance of the number of children (hash collisions in recursive calls). If 1/pi 
is integer, the number of children in our branching process follows a binomial 
distribution with mean 1. This implies that < 1. Also in the case where 1/pi 
is not integer, it is easy to see that the variance is bounded by 2. That is, we 
have Var (Xi) < 2z, which by Chebychev’s inequality means that for some integer 

j* =2Vi + 0{l): 


E P'' - E Var (A,) /f < 1/2 . 

j=j* 

Since we have E [A^] = Pr [Xi > j] = 1 then Pr [Xi > j] > 1/2, 

and since Pr [Xi > j] is non-increasing with j this implies that Pr [Xi > 1] > 
l/(2j*) = f2{l/^/i). Furthermore, the recursion depth OflogN) implies the 

probability that a near pair is found is 17 (l/-\/log A). Thus, by repeating O ^log^^^ A 

times we can make the error probability 0(l/N^) for a particular pair and 
O (1/A) for the entire output by applying the union bound. 
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3.4 Removing duplicates 

Given two near points x and y, the definition of LSH requires their collision 
probability p{x,y) = Pr [h(a;) = h{y)] > pi. If p{x,y) ^ pi, our OSimJoin 
algorithm can emit {x, y) many times. As an example suppose that the algorithm 
ends in one recursive call: then, the pair (x, y) is expected to be in the same 
bucket for p(x,y)L iterations of Step 6 and thus it is emitted p{x,y)L ^ 1 
times in expectation. Moreover, if the pair is not emitted in the first recursive 
level, the expected number of emitted pairs increases as (p{x,y)Ly since the 
pair (x,y) is contained in (p{x,y)Ly subproblems at the f-th recursive level. A 
simple solution requires to store all emitted near pairs on the external memory, 
and then using a cache-oblivious sorting algorithm [9] for removing repetitions. 

However, this approach requires O I/Os, where n is the expected 

average replication of each emitted pair, which can dominate the complexity 
of OSimJoin. A similar issue appears in the cache-aware algorithm ASimJoin 
as well: a near pair is emitted at most L' = {N/M)p times since there is no 
recursion and the partitioning of the two input sets is repeated only L' times. 

If the collision probability Pr [h{x) = h{y)] can be explicitly computed in 
0(1) time and no I/Os for each pair {x,y), it is possible to emit each near 
pair once in expectation without storing near pairs on the external memory. We 
note that the collision probability can be computed for many metrics, including 
Hamming [14], £i and ^2 [7], Jaccard [4], and angular [5] distances. For the 
cache-oblivious algorithm, the approach is the following: for each near pair (x, y) 
that is found at the z-th recursive level, with z > 0, the pair is emitted with 
probability 1/{p{x, y)Ly; otherwise, we ignore it. For the cache-aware algorithm, 
the idea is the same but a near pair is emitted with probability l/{p{x, y)L') with 
L' = {N/M)P. 

Theorem 4. The above approaches guarantee that each near pair is emitted 
with constant probability in both ASimJoin and OSimJoin. 

Proof. The claim easily follows for the cache-aware algorithm: indeed the two 
points of a near pair {x, y) have the same hash value in p{x, y)L (in expectation) 
of the L' = [N/My repetitions of Step 3. Therefore, by emitting the pair with 
probability 1/{p{x,y)L) we get the claim. 

We now focus on the cache-oblivious algorithm, where the claim requires 
a more articulated proof. Given a near pair {x,y), let Gi and Hi be random 
variables denoting respectively the number of subproblems at level z containing 
the pair {x,y), and the number of subproblems at level z where {x,y) is not 
found by the cache-oblivious nested loop join algorithm in Theorem 2. Let also 
Ki be a random variable denoting the actual number of times the pair (x, y) is 
emitted at level z. We have followings properties: 

1. E [Ki\Gi, Hi] = {Gi — Hi)/{p{x, y)Ly since a near pair is emitted with prob¬ 
ability 1/{p{x,y)Ly only in those subproblems where the pair is found by 
the join algorithm. 
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2. E [Gi] = {p{x,y)Ly since a near pair is in the same bucket with probability 
p(x,yy (it follows from the previous analysis based on standard branching). 

3. Go = 1 since each pair exists at the beginning of the algorithm. 

4. Hxp = 0 since each pair surviving up to the last recursive level is found by 
the nested loop join algorithm. 


We are interested in upper bounding E 



by induction that 


E 






E[Hi] 

{p{x,y)Ly 


for any 0 < I < 'F. For I = 0 (i.e., the first call to OSimJoin) and note that 
E [Go] = Go = 1, the equality is verified since 


E [Ko] = E [E [Ko\Go, Ho]] = E[Go - Ho] = I-E [Ho] 


We now consider a generic level I > 0. Since a pair propagates in a lower 
recursive level with probability p{x,y), we have 

E[Gi]=E[E[Gi]Hi_y]=p{x,y)LE [Hi_y . 


Thus, 


E[Ki]=E[E[Ki]GpHi]]=E 


Gi-Hi 


Y{p{x,y)Ly\ 

E[He-i] 


E[Hi] 


{p{x,y)Ly 1 {p{x,y)Ly 

By exploiting the inductive hypothesis, we get 

"i-i 


E 


.z=0 


= E [Ki] + E 




i=0 


= 1 - 


E[i//] 


Since H^ = 0, we have E 


Ef=o^* 


{p{x,y)Ly' 

= 1 and the claim follows. 


□ 


We observe that the proposed approach is equivalent to use an LSH where 
p{x, y) = Pi for each near pair. Finally, we remark that this approach does not 
avoid replica of the same near pair when the algorithm is repeated for increasing 
the collision probability of near pairs. Thus, the probability of emitting a pair is 
at least H as shown in the second part of Section 3.3 and O ^log^^^ n'^ 

repetitions of OSimJoin suffices to find all pairs with high probability (however, 
the expected number of replica of a given near pair becomes O ^log^^^ iV^ , even 
with the proposed approach). 
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4 Discussion 


We will argue informally that our I/O complexity of Theorem 1 is close to the 
optimal. For simple arguments, we split the I/O complexity of our algorithms in 
two parts: 


Ti 



\R xa S\ 

<r 

MB 


\R cx S\ 

<cr 

MB 


We now argue that Ti I/Os are necessary. First, notice that we need O {N/B) 
I/Os per hash function for transferring data between memories, computing and 
writing hash values to disk to find collisions. Second, since each I/O brings at 
most B points in order to compute the distance with M points residing in the 
internal memory, we need N/B I/Os to examine MN pairs. This means that 
when the collision probability of far pairs P 2 M/N and the number of collisions 
of far pairs is at most MN in expectation, we only need O {N/B) I/Os to detect 
such far pairs. Now we consider the case where there are 17 (N^) pairs at distance 
cr. Due to the monotonicity of LSH family, the collision probability for each such 
pair must be O (M/N) to ensure that O {N/B) I/Os suffices to examine such 
pairs. In turn, this means that the collision probability for near pairs within 
distance r must be at most O {{M/N)^). So we need fi{{N/M)P) repetitions 
(different hash functions) to expect at least one collision for any near pair. 

Then, a worst-case data set can be given so that we might need to examine, for 
each of the 17 {{N/M)p) hash functions, a constant fraction the pairs in R ixi<j. S 
whose collision probability is constant. For example, this can happen if R and 
S include two clusters of very near points. One could speculate that some pairs 
could be marked as “finished” during computation such that we do not have to 
compute their distances again. However, it seems hard to make this idea work 
for an arbitrary distance measure where there may be very little structure for 
the output set, hence the O (|i?txi<r S\/{MB)) additional I/Os per repetition 
is needed. 

In order to argue that the term T 2 is needed, we consider the case where all 
pairs in R [xi<cr S have distance r + e for a value e small enough to make the 
collision probability of pair at distance r + e indistinguishable from the collision 
probability of pair at distance r. Then every pair in R txi^cr S must be brought 
into the internal memory to ensure the correct result, which requires T 2 I/Os. 
This holds for any algorithm enumerating or listing the near pairs. Therefore, 
there does not exist an algorithm that beats the quadratic dependency on N 
for such worst-case input sets, unless the distribution of the input is known 
beforehand. However, when |i? ixi<cr- S'] is subquadratic regarding TV, a potential 
approach to achieve subquadratic dependency in expectation for similarity join 
problem is filtering invalid pairs based on their distances — currently LSH-based 
method is the only way to do this. 
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Jaccard and cosine similarity LI and L2 distance 

(a) Enron email dataset (b) MNiST dataset 

Fig. 1. The cumulative distributions of pairwise similarities and pairwise distances on 
samples of 10,000 points from Enron Email and MNIST datasets. We note that values 
decrease on the x-axis of Figure l.a, while they increase in Figure l.b. 


Note that when M = N our I/O cost is O (N/B) as we would expect, since 
just reading the input is optimal. At the other extreme, when B = M = 1 
our bound matches the time complexity of internal memory techniques. When 
|i? ixi<cr 51 are bounded by MN then our algorithm achieves subquadratic 
dependency on N/M. Such an assumption is realistic in some real-world datasets 
as shown in the experimental evaluation section. 

To complement the above discussion we will evaluate our complexity by com¬ 
puting explicit constants and then evaluating the total number of I/Os spent 
by analyzing real datasets. Performing these “simulated experiments” has the 
advantage over real experiments that we are not impacted by any properties of 
a physical machine. We again split the I/O complexity of our algorithms in two 
parts: 


/Ny IN 1 -^^ 

\m) 1 B MB 

\R [X S\ 

<cr 

MB 

and carry out experiments to demonstrate that the first term Ti often dominates 
the second term T 2 in real datasets. In particular, we depict the cumulative 
distribution function (cdf) in log-log scale of all pairwise distances (i.e., £1, £2) 
and all pairwise similarities (i.e., Jaccard and cosine) on two commonly used 
datasets: Enron Email^ and MNIST®, as shown in Figure I. Since the Enron 
data set does not have a fixed data size per point, we consider a version of the 
data set where the dimension has been reduced such that each vector has a fixed 
size. 

® https://archive.ics.uci.edu/ml/datasets/Bag-|-of+Words 
® http://yann.lecun.com/exdb/mnist/ 


Ti = 

T2 = 
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Fig. 2. A comparison of I/O cost for similarity joins on the standard LSH, nested loop 
and ASimJoin algorithms. 


Figure l.a shows an inverse polynomial relationship with a small exponent 
m between similarity threshold s and the number of pairwise similarities greater 
than s. The degree of the polynomial is particularly low when s > 0.5. This set¬ 
ting s > 0.5 is commonly used in many applications for both Jaccard and cosine 
similarities [2,3,19]. Similarly, Figure l.b also shows a monomial relationship 
between the distance threshold r and the number of pairwise distances smaller 
than r. In turn, this means that the number of c-near pairs \R ixi<ci. S'] is not 
much greater than c^ji? ixi<j. S']. In other words, the second term T 2 is often 
much smaller than the first term Ti. 

Finally, for the same data sets and metrics, we simulated the cache-aware 
algorithm with explicit constants and examined the I/Os cost to compare with 
a standard nested loop method (Section 2) and a lower bound on the standard 
LSH method (Section 2). We set the cache size M = A^/1000, which is reasonable 
for judging a number of cache misses since the size ratio between CPU caches 
and RAM is in that order of magnitude. In general such setting allows us to 
investigate what happens when the data size is much larger than fast memory. 
For simplicity we use B = 1 since all methods contain a multiplicative factor 
1/B on the I/O complexity. The values of p were computed using good LSH 
families for the specific metric and parameters r and c. These parameters are 
picked according to Figure 1 such that the number of c-near pairs are only an 
order of magnitude larger than the number of near pairs. 

The I/O complexity used for nested loop join is 2N + /MB (here we as¬ 
sume both sets have size N) and the complexity for the standard LSH approach 
is lower bounded by sort . This complexity is a lower bound on the stan¬ 

dard sorting based approaches as it lacks the additional cost that depends on 
how LSH distributes the points. Since M = iV/1000 we can bound the log-factor 
of the sorting complexity and use sort (N) < 8N since 2N points read and writ¬ 
ten twice. The I/O complexity of our approach is stated in Theorem 3. The 
computed I/O-values in Figure 2 show that the complexity of our algorithm is 
lower than that of all instances examined. Nested loop suffers from quadratic 
dependency on N, while the standard LSH bounds lack the dependency on M. 
Overall the I/O cost indicates that our cache-aware algorithm is practical on the 
examined data sets. 
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5 Conclusion 


In this paper we examine the problem of computing the similarity join of two 
relations in an external memory setting. Our new cache-aware algorithm of Sec¬ 
tion 3.1 and cache-oblivious algorithm of Section 3.2 improve upon current state 
of the art by around a factor of {M/BY I/Os unless the number of c-near pairs 
is huge (more than NM). We believe this is the first cache-oblivious algorithm 
for similarity join, and more importantly the first subquadratic algorithm whose 
I/O performance improves significantly when the size of internal memory grows. 

It would be interesting to investigate if our cache-oblivious approach is also 
practical — this might require adjusting parameters such as L. Our I/O bound 
is probably not easy to improve significantly, but interesting open problems 
are to remove the error probability of the algorithm and to improve the implicit 
dependence on dimension in B and M. Note that our work assumes for simplicity 
that the unit of M and B is number of points, but in general we may get tighter 
bounds by taking into account the gap between the space required to store a point 
and the space for hash values. Also, the result in this paper is made with general 
spaces in mind and it is an interesting direction to examine if the dependence 
on dimension could be made explicit and improved in specific spaces. 
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